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SUMMARY 

Ada 1 packages implementing selected mathematical functions for the support of scientific and 
engineering applications have been written. The packages provide the Ada programmer with the 
mathematical function support found in the languages Pascal and Fortran as well as an extended 
precision arithmetic and a complete complex arithmetic. The algorithms used are fully described 
and analyzed. Implementation assumes that the Ada type FLOAT objects fully conform to the 
IEEE 754-1985 standard for single binary floating-point arithmetic, and that INTEGER objects are 
32-bit entities. Codes for the Ada packages are included as appendixes. 

INTRODUCTION 

The packages described in this report were developed as infrastructure for benchmark programs 
written to test a new computer. They provide selected mathematical functions, elementary 
transcendental functions, complex arithmetic, and a limited facility for extended precision 
arithmetic. Other packages for extended precision arithmetic were written to support these 
functions. The implementation of the functions, coded in Ada [1], assume IEEE 754-1985 [2] single 
binary floating point arithmetic and 32-bit integers are supported at the hardware level. Project 
design goals were (1) a portable, robust self contained library in Ada, (2) support for 
scientific/engineering computation, and (3) accurate, efficiently produced results. 

Every means was used to ensure the accuracy and correctness of the functions. Additionally, the 
algorithms were coded and tested for actual performance. The packages coded are given as 
appendixes. Individual packages implementing the selected functions are provided as 
appendixes A through D of this report. The remainder of the report discusses, in detail, the 
implementation of these functions. 


FLOATING-POINT ARITHMETIC 

The IEEE 754-1985 standard is a specification of arithmetic formats, operations, and details 
concerned with the accuracy of results and the treatment of exceptional conditions. For this paper, 
the relevant specification is single binary floating-point format, which is a 1-bit sign field, an 
8-bit biased exponent field e, 0 £ e £ 255, and a 23-bit fraction field, f = O.bjhj.. b ffl . The 32-bit single 
binary float formatted number X is stored as 
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23 
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Its value is decoded from its stored representation by means of the following rules: 


e 

f 

Interpretation 

255 

0 

(-1) * °° 

255 

* 0 

NaN, regardless of s 

0 

0 

(-1)* 0 (zero) 

0 

* 0 

(-1)* 2‘ 126 O.f (denormalized number) 

>0, <255 


(-1) * 2** lZ7 l.f 


For numerical algorithms, the major consequences of the format are 

• accuracy is 24 bits, at most 

• the smallest representable number e, such that (1 + e) - 1 > 0, in floating- 
point arithmetic is e = 2 a = 1.192 ' x 10' 7 (therefore, approximations need 
barely more than seven digits of relative accuracy). 

• the range of normalized floating-point numbers is 2'™ (« 1.175494211 10 to 
2 127 (2-2 a ) (- 1.701411632 10 s ) . 

Some Basic Procedures 

While the elementary transcendental functions are all well behaved, their Taylor series are 
unsatisfactory for numerical calculation. To bound the time necessary for a given accuracy, the 
range over which an approximation is based must be limited. Range reduction requires either 
separating the fields of a floating-point number or arithmetic manipulation. In Ada, field 
separation is most quickly done by instantiating the generic function 
unchecked_conversion available in the Ada package system [1], Converting a FLOAT 
number to INTEGER without changing the bit pattern, and separation is easily done in integer 
arithmetic. Conversely, from s, e, and f, an INTEGER with the appropriate bit pattern can be 
constructed and then converted to a FLOAT number by instantiation of the same generic function. 

These fundamental procedures are given as package IEEE754Lib in appendix A. Any floating 
arithmetic requires similar routines. Machine language code would be more efficient, but 
portability would be sacrificed. 

EXTENDED ARITHMETIC PACKAGE 

Digital computer floating-point arithmetic is grainy with gaps between numbers. For example, in 
IEEE 754-1985 arithmetic, the first number after 2 s is 2 B +2. Consequently, results of arithmetic 
operations may not be exactly expressible. Error also occurs from the truncation of constants. For 
example, the constant Jt is expressible only by an infinite sequence of bits, but with IEEE 754-1985 
arithmetic, the nearest value is about 3.1415927. A final source of error arises when two nearly 
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equal numbers are subtracted. Fewer bits remain in the difference. Sometimes this is called 
catastrophic cancellation, but the term is misleading, because the result is exact and is a natural 
consequence of finite representation. 

Examples of how range reduction can lead to the last kind of error are the 2n periodic functions 
sine and cosine. An argument is reduced to an equivalent value in [- n/4, n/4). For a large 
argument, many bits encode which interval of length n/2 contains it. For such arguments, 
accuracy can be maintained only with extended values for constants and extra precise arithmetic 
for this part of the calculation. 

Linnainmaa [3] describes a portable, doubled precision arithmetic based on the native floating- 
point arithmetic. The algorithms depend on the assumption that native floating-point arithmetic 
is either correctly rounded or correctly chopped. An extended precision number is represented as 
a pair of single precision numbers (x_high, x_low) which, in the native floating-point 
arithmetic, exactly satisfy 

x_high “ x_high + x_low 

floating-point numbers are separated into pieces, each of which has sufficiently few bits that the 
result of an arithmetic operation with these numbers is always exactly represented. The 
separation process for numbers reduces the dynamic range of numbers somewhat, but this is not a 
problem in the library packages. 

Error in extended arithmetic is complicated by truncation, modes of rounding, and extra bits in 
intermediate results. However, for IEEE 754-1985 single binary float arithmetic with correct 
rounding, extended multiply is accurate to 47 bits, extended divide to 46.5 bits, and extended add to 
48 bits. This is more than sufficient for the library packages. An Ada implementation of this 
extended arithmetic is appendix B. Quite clearly, this extended arithmetic is applicable to far 
more than just the libraries discussed here. Furthermore, the same ideas can be used to extend any 
higher precision arithmetic available as well, including the extended precision itself. However, 
this extension to multiple precision eventually fails because full double precision is unattainable. 

COMPLEX ARITHMETIC PACKAGE 

This package carefully implements Cartesian complex arithmetic. A type complex (a record of 
two numbers of type FLOAT), is declared with functions for manipulating these objects. The 
arithmetic operators are overloaded. Facilities for creating and mixing these new numbers with 
numbers of type FLOAT are included. Also included are a complex absolute value function and 
a complex square root function. Special care has been taken to ensure accuracy and absence of 
false errors. The package, named ComplexArithmetic, is given as appendix C. 
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THE MATHLIB PACKAGE 


The Basic Mathematical Functions Included in the package are 


Name 

Argument List 

Value Type 

Function 

floor 
ceil 
t rune 
xmod 

x : float 

x : float 

x : float 

x, y : float 

integer 

integer 

integer 

float 

nearest integer <= x 
nearest integer >= x 
round x toward 0 to integer 
(x - trunc(x / y) * y) 

urand 

urandinit 

sl, s2 integer 

float 

random numbers in (0,1) 
initialize urand 


The first four are simple conversion and manipulation functions. The last two provide a 
reinitializable 32-bit, portable, uniform random number generator with an extremely long period 
(about 10 B ) and very good statistical properties (see [4]). The random number generator depends 
heavily upon 32-bit integers for efficient implementation. 


Transcendental functions Included in the package are 


Name 

Function 

Range 

sin(x) 

trigonometric sin* of x (radians) 

|X| < 2'* 

cos(x) 

trigonometric cosine of x (radians) 

|x| < 2 U 

•xp (x) 

constant e raised to the power x 

x < 88.02969 

log(x) 

natural logarithm of x 

x > 0.0 

sqrt(x) 

square root of x 

x £ 0.0 

atan(x) 

arc tangent of x, radians 

all float numbers 

atan2 (x,y) 

proper quadrant arctangent (x/y) 

all float numbers 


This small set of functions satisfies most needs because these functions form the building blocks 
for other functions and are the basis for many scientific/engineering calculations. Accuracy has 
been emphasized over efficiency. All the approximations, except the square root, are rational 
approximations over intervals. They are optimal in the sense of minimizing the maximum error 
over the interval of approximation. Errors because of the basic approximations are slightly 
increased because some exact constants are preserved and the approximations are accurate on an 
interval slightly larger than the interval used. 


The latter is of practical importance because the actual argument is reduced to the interval of 
approximation. Even with careful range reduction, the reduced argument may fall outside the 
interval of approximation because of arithmetic error in the reduction phase. All these 
approximations are found in reference [5]. 
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Individually, the basic approximations are accurate when just single precision binary floating- 
point arithmetic is used. When necessary, range reduction is performed with extended precision 
arithmetic and extra precise constants. The implementations avoid false errors, i.e. those not 
attributable to improper arguments. For out-of-range arguments an exception, domain_error, is 
declared. Whenever an out-of-range argument is passed, an exception is raised and control 
returns to the calling module. 

Algorithms- The algorithms and error analysis for the transcendental function approximations 
are given and analyzed in this section. In addition to the usual theoretical error bounds, the 
results from actual use of the implementations based on the approximations are included. 

sqrt- This algorithm uses the standard Newton’s iteration method with a starting approximation 
that guarantees accuracy for all arguments at the end of three iterations. 

If x = 0, then sqrt(x) = 0. Otherwise, let the argument be x = 2* l.f. Set 

yo * M®«5 * l.f + C m ) 2*^ 2 , if • is even 

1(0.25 * l.f + C 0 ) 2 ( *" 1)/2 , if • is odd 
where C, ■ l , C Q * 4 C« - 1 

VvTT V. * 

and then 

Yl ■ 0.5 (y 0 + x/y 0 ) 

Y2 * 0.5 (y 1 + x/y x ) 

y 3 » y 2 - 0.5 (y 2 - x/y 2 ) 

where y 3 is the approximate square root. 

Error Analysis- If y is an approximation to the square root of x, and we define 8 by 

Y ’ *(hH) 

then 



[6] shows that 8(yo) is accurate to more than 4.79 bits. Thus, y„ y* and y s are all larger than the true 
square root and satisfy the above equation with values of 8 smaller than 2 W , 2 s2 , and 2‘ 45 , 
respectively. Hence, the successive relative errors are 0.075, 0.00262, 3.41xl0' s , and 5.8xl0 -12 . The 
final iteration is a small correction to an already accurate value. 

sin-cos- First, the range is reduced to the interval [-jt/4, n/4] using 

x ■ (4n + j) n/2 + y, lyl Sre/4 


and the relations 
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/ sin (y) , 

if 

j sa 0 


cos (y) , 

if 

j 

= 0 

sln(x) * i 

' cos (y) , 

if 

j - i 

cos (x) = ( 

-sin(y) , 

if 

j 

= 1 


-sin(y) , 

if 

j - 2 


-cos (y) , 

if 

j 

= 2 


-cos (y) , 

if 

j * 3 


i sin(y), 

if 

j 

= 3 


The approximations for sin(y) and cos(y) in the reduced range are 


sin(y) - y + y’ (-0.16666653 + y (0.0083320645 - 0.00019502220 y ) ) 

cos (y) - 1 + y J (-0.5 + y 1 (0.041666644 + y 1 (0 .000024423102y J - 0.0013887229))) 

These approximations are correct to more than 27 and 32 bits, respectively. 

Error Analysis- In the sine approximation, rounding error occurs in squaring and cubing y 
which can generate 1 and 2 bits of error respectively, and truncating coefficients. However, the 
number added to y never exceeds 0.1 in relative size, so more than 3 bits are shifted off the end, 
leaving, at most, a single rounding error. Actual accuracy for the sine using rounded arithmetic 
is more than 23.9 bits. 

In the cosine approximation, rounding errors occur in squaring y and the evaluation of the 
polynomial added to 1. When rounded arithmetic is used the actual accuracy is almost 23.6 bits for 
the cosine. 

Error in the range reduction step occurs as a result of the less than full double precision of a 
result. Since the error is limited to less than 2 bits, no error occurs until the argument to the 
sine/cosine routine exceeds 2 a . Thus, only in very unusual circumstances will this be a concern. 

exp- The exponential function is written 

exp(x) ■ 2 n exp (y) 

Y ■ x - n log. (2) , lyl £ log. (2)/2 


axp(y) ■ 1 + y + 


The multiplication by 2” is exact in binary arithmetic. In the basic interval, a rational 
approximation 

y [y - y 2 Ply 2 )] 

2 - y - [y 2 Ply 2 )] 
is used. Here, the final addition yields an error of at most 0.5 bits. This form preserves weak 
monotonicity for negative arguments. That is, the strong monotonicity property of the original 
function x < y implies exp(x) < exp (y ) is preserved only in the form x < y implies 
exp (x) £ exp (y) in the approximation. This is the best that can be expected from floating- 
point arithmetic. Some applications require preservation of monotonicity. For IEEE 754-1985 
single binary float arithmetic, the polynomial is 
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0.16666612 


0.0027652702 x 


P(x) - 

and the actual accuracy using rounded arithmetic is better than 23.8 bits. 

log- For any positive x, the natural logarithm can be defined from its values over a limited 
domain as 

log tt (x) - n log a (2) + log® (y) , Vo.5 < y < V2 


Using binary arithmetic, n and y are obtained exactly from the floating-point representation of x. 
The approximation for log(y) in the reduced domain is 


log (y) 


(y 



Y - 1 

y + 1 


In this form, the multiplier of any error in v can be as large as 2.0. To reduce this error to at most 
0.395, the identity 


2v - (y - l) + v(i - y) 


is used to convert the approximation to the form 

log (y) ■ (y - i) + ▼ 




The quantity n log(2) is calculated using an extended precision constant. So, if 

n log* (2) - (A, B) 

the value of the approximate logarithm is calculated by summing, from right to left, the terms i n 


where 


log (x) 


A + (y - 1) + B + v 



Q (*) - 1.5000459 - 0.90463380 z 


is accurate to more than 25 bits. With IEEE 754-1985 rounded arithmetic, the actual relative 
accuracy is greater then 23.4 bits in the reduced interval, and more than 24 bits otherwise. 


atari- For any x, the arc tangent function can be stably evaluated using the continued fraction 


atan (x) 


x x2 4x2 9x2 fc*) 2 

1+ 3 + 5 + 7+ (2k + l) + 
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However, this is not economical because the amount of work necessary to maintain a given level 
of accuracy grows unboundedly with the argument. To keep the amount of work small, the range 
of x is reduced using 

atan(z) « sgn(z) n/2 - atan(l/x) 

for I x I >1, and further reducing the range with 


atan (z) 


sgn(z) 


— + atan(y) 
6 


|x|V3~ - 1 
1*1 + f3 


Hi 1 ?) 


< i*i < i 


Compounding of induced errors is avoided with the elaborated approximation 


atan (z) 


atan (x) 


r 1*1 

< Hfr) - 

agn(z) 

jj- + »tan (y) 

<■ y ■ 

I*|V3 - 1 
1*1 + V3 ' 

agn (z) 

F- - »tan (y) 

r y “ 

VT - i z i 


13 


1 + |z|>3 

sgn (z) 

-®- - atan (y) 
2 

r y - 

1/*, z > 

ta 


-fc) 


< i*i < i 


, i < i*i < 


tanU-l 

\12) 


■te) 


The basic approximating form is 

atan (y) - y - y*/Q(y*) 

where Q ( z ) is a polynomial. The polynomial 


Q(z) * 3.0000071 + z (1.7994048 - 0.19159707 z) 

gives results that are accurate to more than 28 bits. When rounded arithmetic is used, the 
approximation is correct to 24, 22.8, 23.3, and 23.56 bits in the four ranges. Extended precision 
helps preserve accuracy in forming the numerator in the second range. 


atan2- The proper quadrant arc tangent function is defined, using the atan function, by 

( sgn(z) K 1 2 , if y = 0 
atan(z/y), if y > 0 
sgn (z) n + atan (z/y) , if y < 0 

For this function there are two errors beyond those in atan. The first is the error from forming 
z / y. The second is from the addition of re . The second can be avoided with an extended precision 
value of it. But the first induces an error into the argument of the atan function which cannot be 
avoided or corrected. The relative error is less than 0.5 bits and this is reflected in an increased 
error not exceeding 0.5 bits. 

The entire mathematics package, MathLib, is given in appendix D. 
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Appendix A. Low Level Routines - Package lEEE754Lib 


package IEEE754Lib is 


type fpcomponents is 
record 

s : integer 0..1; 
e : integer -128 . . 127; 
f : integer 0 .. 8#77777777#; 
end record; 


procedure Strip (x : in float; r :out fpcomponents) ; 
function Assemble ( r : in fpcomponents) return float; 
end IEEE754Lib; 


with unchecked_conversion; 
package body IEEE754Lib is 

t23 : constant integer 2**23; 

x_as_integer, 

local_e, 

local_f : integer; 

function f loat_to_integer is new unchecked_conversion (float, integer) ; 
function integer__to_ f loat is new unchecked_conversion (integer, float ) ; 
procedure Strip (x : in float; r : out fpcomponents) is 
begin 

if x < 0.0 then r.s :« 1; else r.s : m 0; end if; — sign field 
x_as_integer :• f loat_to_integer (abs x); 
local^e x_as_integer / t23; — exponent field 

local_f :« x_as_integer mod t23; — fractional field 
if local_e >0 then 

r.e local_e - 127; 
r.f :» local__f + t23; 
else 

r.e -128; 

r.f :« local_f; 
end if; 
end Strip; 


function Assemble ( r : in fpcomponents) return float is 
begin 

if r.e - -128 then 

x_as_integer r.f; 
else 

*_as_integer :« (r.e + 127) * t23 + (r.f - t23) ; 
end if; 

if r.s - 1 then 

return -integer_to_f loat (x_as_integer) ; 
else 

return integer_to_f loat (x_as_integer) ; 
end if; 
end Assemble; 
end IEEE754Lib; 
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Appendix B. The Extended Arithmetic Package - ExtendedArithmetic 


package ExtendedArithmetic is 

— a portable extended arithmetic that yields slightly dirty double accuracy. 

— It was not implemented using a new data type and overloaded operations, 

— because it is not intended for extensive use. 

— A doubly precise numbers is an ordered pair of type float 

— numbers (x,xx) with the property: x«fl (x+xx) exactly. 


— The procedures are: 


— 

Emult 

(a. 

c. 

x, 

XX ) 



— 

Empy 

(a, 

aa, 

c. 

cc. 

X, 

xx) 

— 

Ediv 

(a. 

aa. 

C, 

cc, 

X, 

xx) 

— 

Eadd 

(a. 

aa, 

c , 

cc. 

X, 

xx) 


— all arguments are floating-point 


(x,xx) * a * c exactly 

(x,xx) « (a,aa) times (c,cc) 

(x,xx) - (a,aa) divided by (c,cc) 

(x,xx) « (a,aa) plus (c,cc) 

numbers . 


procedure Emult (a, c : in float; x, xx : out float); 

procedure Empy(a, aa, c, cc : in float; x, xx : out float); 

procedure Eadd(a, aa, c, cc : in float; x , xx : out float); 

procedure Ediv(a, aa, c, cc : in float; x, xx : out float); 

end ExtendedArithmetic; 


package body ExtendedArithmetic is 
— Local variables 

q* qq, 

Z, 2Z, 

al, a2, 
cl, c2, 
c21, c22 , 
u, 
t ¥ 

su : float; 

w : constant float :« 4097.0; — IEEE 754 specific 


procedure Emult(a, c : in float; x, xx : out float) is 
begin 

t :» a * w; 

al :« (a - t) + t; 

a2 :■ a - al; 

t :« c * w; 

cl :* (c - t) + t; 

c2 :- c - cl; 

t :■ c2 * w; 

c21 (c2 - t) + t; 

c22 :■ c2 - c21; 

u a * c; 

x : ■ u ; 

xx:- ((((al*cl - u) + al * c2) + cl * a2) + c21 * a2) + c22 * a2; 
end Emult; 
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procedure Empy{a, aa, c, cc : in float; x, xx : out float) is 
begin 

Emult<a, c , z, q) ; 
zz (a * cc + c * aa) + q; 
u :■ z + zz; 
x : - u ; 

xx (z - u) + zz; 
end Empy; 


procedure Eadd(a, aa, c, cc : in float; x, xx : out float) is 
begin 

z :« a + c; 
q a - z; 

zz :» (((q 4* c) + (a - (q + z) ) ) + aa) + cc; 
u z + zz; 
x : ■ u ; 

xx (z - u) + zz; 
end Eadd; 


procedure Ediv(a, aa, c, cc : in float; x, xx : out float) is 
begin 

z :» a / c; 

Emult (c, z, q, qq) ; 

zz ((((a - q) - qq) + aa) - z * cc) / c; 
u :* z + zz; 
x : - u ; 

xx :« (z - u) + zz; 
end Ediv; 


end Ex tendedAri thine tic; 
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Appendix C. The Complex Arithmetic Package - ComplexArithmetic 


— This package provides a complex arithmetic overloading the usual 

— arithmetic operators with its own. The new operators work on objects 

— of type complex, which is a pair of float numbers, optionally mixed with 
float numbers. 

— Emphasis has been placed upon accuracy of the results. 

— A complex square root is included. 

package ComplexArithmetic is 

— new data type 

type complex is record 
real, 

imag : float; 
end record; 

Arithmetic operations 

function ”+" (x, y : complex) return complex; 
function "+" (x : float; y : complex) return complex; 

function "+” (x : complex; y : float) return complex; 

function (x, y : complex) return complex; 
function (x : float; y : complex) return complex; 

function (x : complex; y : float) return complex; 

function ”-”{x : complex) return complex; 
function ”* n (x, y : complex) return complex; 
function (x : float; y : complex) return complex; 

function (x : complex; y: float) return complex; 

function "/" (x, y : complex) return complex; 
function "/”{x : float; y : complex) return complex; 

function n / n (x : complex; y : float) return complex; 

— manipulation of complex numbers 

— real part 

function realpart (x : complex) return float; 

— imaginary part 

function imagpart(x : complex) return float; 

— form a complex number from its piees 
function cmplx(x, y : float) return complex; 

— complex conjugate 

function conjg(x : complex) return complex; 

— complex length 

function cabs (x : complex) return float; 

— Square root 

function csqrt (x : complex) return complex; 
end ComplexArithmetic; 
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with Mathlib; use Mathlib; 
package body ComplexArithmetic is 

function "+" (x, y : complex) return complex is 
begin 

return (x.real + y.real, x.imag + y.imag); 
end; 


function "+" (x : float; y : complex) return complex is 
begin 

return (x + y.real, y.imag); 
end; 


function ”+" (x : complex; y : float) return complex is 
begin 

return (x.real + y, x.imag); 
end; 


function (x, y : complex) return complex is 
begin 

return (x.real - y.real, x.imag - y.imag); 
end; 


function "-"(x : float; y : complex) return complex is 
begin 

return (x - y.real, -y.imag); 
end; 


function (x : complex; y : float) return complex is 
begin 

return (x.real - y, x.imag); 
end; 


function ”-"(x : complex) return complex is 
begin 

return (-realpart (x) , -imagpart (x) ) ; 

end 


function ”*"(x, y : complex) return complex is 
begin 

return (x.real * y.real - x.imag * y.imag, 
x.real * y.imag + x.imag * y.real); 

end; 


function ™*"(x : float; y : complex) return complex is 
begin 

return (x * y.real, x * y.imag); 
end; 


function (x : complex; y: float) return complex is 
begin 

return (x.real * y, x.imag * y) ; 
end; 
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function "/" (x, y : complex) return complex is 
denom, prodef : float; 
begin 

if y.real * 0.0 and y.imag - 0.0 then 
raise Numeric_Error; 
end if; 

if abs y.real >* abs y.imag then 
prodef :« y.imag / y.real; 
denom 1.0 / (y.real + y.imag * prodef); 
return (denom * (x.real + x.imag * prodef), 

denom * (x.imag - x.real * prodef)); 

else 

prodef :* y . real/y . imag; 

denom :« 1.0 / (y.imag + y.real * prodef); 
return (denom * (x.imag + x.real * prodef), 

denom * (x.imag * prodef - x.real)); 

end if; 
end; 


function "/"(x : float; y : complex) return complex is 
denom, prodef : float; 
begin 

if y.real - 0.0 and y.imag - 0.0 then 
raise Numeric_Error; 
end if; 

if abs y.real >- abs y.imag then 
prodef y.imag / y.real; 
denom :■ x / (y.real + y.imag * prodef); 
return (denom, -denom * prodef) ; 
else 

prodef :* y. real/y . imag; 
denom :■* x / (y.imag + y.real * prodef); 
return (denom * prodef, -denom) ; 
end if; 
end; 


function "/" (x : complex; y : float) return complex is 
begin 

return x * (1.0 / y); 
end; 


function cabs (x : complex) return float is 
y : float; 
begin 

if x.real » 0.0 and x.imag * 0.0 then 
return 0.0; 
end if; 

if abs x.real >- abs x.imag then 

return (abs x.real) * sqrt ( (x.imag / x.real) ** 2 + 1.0); 
else 

return (abs x.imag) * sqrt ((x.real / x.imag) ** 2 + 1.0); 
end if; 
end cabs; 
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function realpart (x : complex) return float is 


begin 

return x.real; 
end; 


function imagpart (x : complex) return float is 
begin 

return x.imag; 
end; 


function cmplx(x, y : float) return complex is 
begin 

return (x, y) ; 
end; 


function conjg(x : complex) return complex is 
begin 

return (x . real, -x . imag) ; 
end; 


function csqrt (x : complex) return complex is 

q : float; 
begin 

if x.imag - 0.0 then 

if x.real >-0.0 then 

return (sqrt (x. real) , 0.0); 
else 

return (0.0, sqrt (-x . real) ) ; 
end if; 
else 

q sqrt (0.5 * (abs (x.real) + cabs (x) ) ) ; 
if x.real >- 0.0 then 

return (q, 0.5 * x.imag / q) ; 
else 

return (-0.5 * x.imag / q, -q) ; 
end if; 
end if; 
end csqrt; 


end ComplexArithmetic; 
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Appendix D. Hie Mathematical Function Package - MathLib 


— Ada does not come with a mathematical function package of any kind, so 

— this one is provided for the most common elementary and transcendental 

— functions. 

— IEEE 754-1985 single binary float format is assumed. 

— A single exception is defined in the package. This exception, named 

— domain_error, is raised whenever an illegal argument is provided to one of 

— the routines. 

package MathLib is 


domain_error 



’exception; 



function 

sin 

(X 


float) 

return 

float; 

— sine of x 

function 

cos 

(X 


float) 

return 

float; 

— cosine of x 

function 

exp 

(X 


float) 

return 

float; 

— e to the power x 

function 

log 

(X 


float) 

return 

float; 

— natural logarithm of x 

function 

sqrt 

(X 


float) 

return 

float; 

— square root of x 

function 

atan 

(X 


float) 

return 

float; 

— arc tangent of x 

function 

atan2 (x, 

y 

float) 

return 

float; 

— proper quadrant atan(x/y) 

function 

xmod 

(X, 

y 

float) 

return 

float; 

— (x - trunc (x / y) * y) 

function 

floor (x 


float) 

return 

integer; 

— nearest integer <= x 

function 

ceil 

(X 


float) 

return 

integer; 

— nearest integer >= x 

function 

trunc (x 


float) 

return 

integer; 

— nearest integer in [0, x) 

function 

urand 

return float; 

— uniform random numbers in {0, 1) 


procedure urandinit (seedl, seed2 : in integer); — reinitialize urand 
end MathLib; 


with IEEE754Lib, ExtendedArithmetic; Use IEEE754Lib, ExtendedArithmet ic ; 
package body MathLib is 


urand default 
s2 : intege 
si : intege 
atan constants 
tanpiby!2 : 
sqrt3 : 

sqrt31o : 
thirdpi : 
sixthpi : 
AtanR : 


: fpcomponents; 
seeds 

r 41569; 

r 46013; 

constant float 
constant float :* 
constant float 
constant float :* 
constant float 
constant array {0 


0.26794919; 
Assemble < (0, 0, 

Assemble ((0, -25, 
1.0471976; 
0.52359878; 

. . 2) of float :« 


8#67331727#) ) ; 
8#41302312#) ) ; 


( -0.19159707, 
1.7994048 , 
3.0000072 ) ; 


sine-cosine constants 
FourthPi : constant 
HalfPi : constant 
HalfPiLo : constant 
sine : constant 


cosine : constant 


float :» 0.78539816; 
float :* Assemble ((0, 
float :* Assemble{(0, 
array (0 . . 2) of float 


array (0 . . 2) of float 


0, 8#62207732#) ) ; 
-24, 8#50420551#) ) ; 

(-0 . 19502220e-3, 

0 . 83320645e-2, 
-0.16666653 ); 

( 0 ,24423102e-4, 
-0 . 13887229e-2, 

0 . 41666644e-l) ; 
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-- exp constants 


ln2hi : 

constant 

float 

:» 

Assemble ( (0, 

-lr 

8#54271027#) ) ; 

ln21o : 

constant 

float 

; s 

Assemble ((0, - 

25, 

8#75750717#) ) ; 

half ln2 : 

constant 

float 


0.34657359; 



domain : 

constant 

float 

l m 

87.6831092; 



ExpR : 

constant 

array 

(0 

. . 1) of float 

:« 

(-0 . 27652702e-2 







0.16666612 

— log constants 






sqrt2 : 

constant 

float 


Assemble ( (0, 

0, 

8#55202363#) ) ; 

LogR : 

constant 

array 

(0 

. . 1) of float 

: - 

( -0.90463380, 







1.5000459) ; 

— sqrt constants 






Ceven : 

constant 

float 

:» 

0.48260052; 



Codd : 

constant 

float 


0.93040208; 



function 

CM 

>i 

>! 

tr 

float) 

return float is 




begin 

return y- (y*y2) / ( ( (AtanR(O) *y2+AtanR(l) ) *y2) +AtanR (2) ) ; 

end q; 


— uniform random number generator based upon 
combined linear congruential generators 
procedure urandinit (seedl, seed2 : in integer) is 
begin 

si :« seedl; 
s2 seed2; 
if si <- 0 then 

si :» si + 2147483563; 
if si <- 0 then 

si si + 2147483563; 
end if; 

end if; 

if s2 <~ 0 then 

s2 s2 + 2147483399; 
if s2 o 0 then 

s2 :« s2 + 2147483399; 
end if; 
end if; 

if si > 2147483563 then 
si :« si - 2147483563; 
end if; 

if s2 > 2147483399 then 
s2 s2 - 2147483399; 
end if; 

end urandinit; 
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function urand return float is 
z, k : integer; 
begin 

k :» si / 53668; 

si :« 40014 * (si - k * 53668) - k * 12211; 
if si < 0 then 

si :« si + 2147483563; 
end if; 

k :« s2 / 52774; 

s2 40692 * (s2 - k * 52774) - k * 3791; 
if s2 < 0 then 

s2 := s2 + 2147483399; 
end if; 
z :« si - s2; 
if z < 1 then 

z : m z + 2147483562; 
end if; 

return float (z) * 4 . 656613e-10 ; 
end urand; 


function floor (x : float) return integer is 
y : float :« abs x; 
fix : integer integer(y); 
z : float float (fix); 
begin 

if x >* 0,0 then 
if z <* y then 
return fix; 
else 

return fix - 1; 
end if; 
else 

if z < y then 

return -1-fix; 
else 

return -fix; 
end if; 
end if; 
end floor; 


function ceil (x : float) return integer is 
begin 

return -floor (-x) ; 
end ceil; 


function trunc (x : float) return integer is 
begin 

if x >■ 0.0 then 
return floor (x); 
else 

return ceil(x); 
end if; 
end trunc; 
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function SineR(x : float) return float is 
x2 : float :« x * x; 
begin 

return x + x * x2 * ((sine(O) * x2 + sine(l)) * x2 + sine (2)); 
end SineR; 


function CosineR(x : float) return float is 
x2 : float x * x; 

begin 

return 1 . 0 + x2 * (-0.5 + x2 * 

( (cosine (0) * x2 + cosine (1)) * x2 + cosine(2))); 

end CosineR; 


procedure ReduceTrigArg { x : float; 

yy : out float; 

jj : out integer) is 

y, whi, wlo : float; 

j : integer; 

z : float :« abs x; 

begin 

j :* truncfz / HalfPi); 

empy (float (j) , 0.0, HalfPi, HalfPiLo, whi, wlo); 
y :* (z - whi) - wlo; 
if y > FourthPi then 

y (y - HalfPi) - HalfPiLo; 

j :■ j + l; 

end if; 

if x < 0.0 then 
yy -y; 

jj s- ( 4- j mod 4) mod 4; 

else 

yy :« y; 
jj :« j mod 4; 
end if; 

end ReduceTrigArg; 


function sin(x : float) return float is 
y : float; 
j : integer; 
begin 

if abs x >- 16777216.0 then 
raise domain_error; 
end if; 

ReduceTrigArg (x, y, j ) ; 
case j is 

when 0 «> return SineR (y) ; 
when 1 -> return CosineR (y); 
when 2 «> return -SineR (y ); 
when 3 «> return -CosineR (y) ; 
when others -> null; 
end case; 
end sin; 
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function cos <x: float) return float is 
y : float; 
j : integer; 
begin 

if abs x >» 16777216.0 then 
raise domain_error; 
end if; 

ReduceTrigArg(x, y, j); 
case j is 

when 0 ■> return CosineR(y); 

when 1 *> return -SineR(y) ; 

when 2 «> return -CosineR(y); 

when 3 ~> return SineR(y); 

when others -> null; 
end case; 
end cos; 


function exp(x : float) return float is 
n : integer; 

fhi, 
f lo, 

y* 

y2, 

z : float; 

begin 

if x >* domain then 

raise domain_error; — overflow will occur 
elsif x o -domain then 
return 0.0; 
end if; 

n integer(x / ln2hi) ; 
y :■ x - float (n) * ln2hi; 
if y > 0.5 * ln2hi then 
n ;* n + 1; 

elsif y < -0.5 * ln2hi then 
n : * n-1; 
end if; 

Empy (float (n) f 0.0, ln2hi, ln21o, fhi, f lo) ; 
y :■ (x - fhi) - flo; 
y2 y * y; 

z y - y2 * (ExpR(0) * y2 + ExpR(l)); 
return (1.0+y+y**/ (2.0 - z) ) *2.0** 
end exp; 


integer (n) ; 
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function log{x : float) return float is 
y f In, 

v, v2 : ^loat; 
n : integer; 

begin 

if x <- 0 . 0 then 

raise domain_error; 
end if; 

Strip (x, r) ; 
n : - r . e ; 

y Assemble ((r.s,0,r.f)); 
if y > sqrt2 then 
y :■ y * 0.5; 

n n + 1; 

elsif y < 0.5 * sqrt2 then 
y y + y; 

n n - 1; 

end if; 

v :- ( y - 1.0) / ( y + 1.0) ; 
v2 :■ v*v; 

In f loat (n) ; 

return ( ( ( (v2 / (LogR(O) * v2 + LogR(l)) + (1.0 - y) ) * v 
+ ln21o * In) + (y - 1.0)) + In * ln2hi) ; 

end log; 
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function atan (x : float) return float is 
Yt 

y2, 

ylo, 

arctan : float; 
begin 

if abs x <- tanpiby!2 then 
y x; 
y2 :« y * y; 
arctan q(y, y2) ; 
return arctan; 

elsif abs x in tanpibyl2 1.0 then 

empy (abs (x) , 0.0, sqrt3, sqrt31o, y, ylo); 
y :* ((y-1.0) + ylo) / (sqrt3 + abs x) ; 
y2 :» y * y; 
arctan :• q(y f y2) ; 
arctan sixthpi + arctan; 
if x >■ 0.0 then 
return arctan; 
else 

return -arctan; 
end if; 

elsif abs x in 1.0 .. 1 . 0/tanpibyl2 then 

y (sqrt3 - abs x) / {1.0 + sqrt3 * abs x) 
y2 := y * y; 
arctan q(y,y2) ; 
arctan ;■ thirdpi - arctan; 
if x >■ 0.0 then 
return arctan; 
else 

return -arctan; 
end if; 

else 

y :*= 1.0 / x; 
y2 :« y * y; 
arctan :* q(y/y2) ; 
if x >= 0.0 then 

return (HalfPi - arctan) + HalfPiLo; 
else 

return -(HalfPi + arctan); 
end if; 

end if; 
end atan; 
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function atan2 (x, y : float) return float is 
z : float; 

begin 

if y ■ 0.0 then 

if x >■ 0.0 then 
return HalfPi; 
else 

return -HalfPi; 
end if; 
end if; 
z :■ x/y; 
if y > 0.0 then 

return atan(z); 

else 

if x > 0.0 then 

return (2.0 * HalfPi + atan(z)) + 2.0 * HalfPiLo; 
else 

return -((2.0 * HalfPi - atan(z)) + 2.0 * HalfPiLo); 
end if; 
end if; 
end atan2; 


function sqrt(x : float) return float is 
y : float; 

begin 

if x <0.0 then 

raise domain_error; 
elsif x * 0.0 then 
return 0.0; 
end if; 

Strip (x, r) ; 

y :« Assemble ( (r. s, 0, r.f)); 
if r.fe mod 2-0 then 

y :» (0.5 * y + Ceven) * 2.0 ** integer (r .e/2) ; 
else 

y (0.25 * y + Codd) * 2.0 ** integer((r.e - 1) / 2); 
end if; 

y 0 .5 * (y+x/y); 

y 0 .5 * (y+x/y) ; 

return y - 0.5 * (y - x / y) ; 
end sqrt; 


function xmod(x, y : float) return float is 
z : float; 
begin 

z x - float (trunc (x/y) ) * y; 
if z < 0.0 then 
z z + abs y; 
end if; 
return z; 
end xmod; 

end MathLib; 
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